Materials data science: descriptors and machine learning¶
Welcome to the materials data science lesson. In this session, we will demonstrate how to use matminer, automatminer, pandas and scikit-learn for machine learning materials properties.
The lesson is split into four sections: 1. Data retrieval and basic analysis of pandas DataFrame objects. 2. Generating machine learnable descriptors. 3. Training, testing and visualizing machine learning methods with scikit-learn and FigRecipes. 4. Automating steps 2 and 3 using automatminer.
Many more tutorials on how to use matminer (beyond the scope of this workshop) are available in the matminer_examples repository, available here.
Machine learning workflow¶
Firstly, what does a typical machine learning workflow look like? The overall process can be summarized as: 1. Take raw inputs, such as a list of compositions, and an associated target property to learn. 2. Convert the raw inputs into descriptors or features that can be learned by machine learning algorithms. 3. Train a machine learning model on the data. 4. Plot and analyze the performance of the model.
Typically, questions asked by a new practitioner in the field include: - Where do we get the raw data from? - How do we convert the raw data into learnable features? - How can we plot and interpret the results of a model? The `matminer` package has been developed to help make machine learning of materials properties easy and hassle free. The aim of matminer is to connect materials data with data mining algorithms and data visualization.
Part 1: Data retrieval and filtering¶
Matminer interfaces with many materials databases, including: - Materials Project - Citrine - AFLOW - Materials Data Facility (MDF) - Materials Platform for Data Science (MPDS)
In addition, it also includes datasets from published literature. Matminer hosts a repository of 26 (and growing) datasets which comes from published and peer-reviewed machine learning investigations of materials properties or publications of high-throughput computing studies.
In this section, we will show how to access and manipulate the datasets from the published literature. More information on accessing other materials databases are detailed in the matminer_examples repository.
A list of the literature-based datasets can be printed using the get_available_datasets() function. This also prints information about what the dataset contains, such as the number of samples, the target properties, and how the data was obtained (e.g., via theory or experiment).
from matminer.datasets import get_available_datasets
get_available_datasets()
load_dataset() function and the database name. To save installation space, the datasets are not automatically downloaded when matminer is installed. Instead, the first time the dataset is loaded, it will be downloaded from the internet and stored in the matminer installation directory. Let's load the dielectric_constant dataset. It contains 1,056 structures with dielectric properties calculated with DFPT-PBE.
from matminer.datasets import load_dataset
df = load_dataset("dielectric_constant")
Manipulating and examining pandas DataFrame objects¶
DataFrame objects. You can think of these as a type of "spreadsheet" object in Python. DataFrames have several useful methods you can use to explore and clean the data, some of which we'll explore below. Inspecting the dataset¶
The head() function prints a summary of the first few rows of a data set. You can scroll across to see more columns. From this, it is easy to see the types of data available in in the dataset.
df.head()
columns attribute: df.columns
DataFrame includes a function called describe() that helps determine statistics for the various numerical/categorical columns in the data. Note that the describe() function only describes numerical columns by default. Sometimes, the describe() function will reveal outliers that indicate mistakes in the data.
df.describe()
Indexing the dataset¶
DataFrame by indexing the object using the column name. For example: df["band_gap"]
Dataframe using the iloc attribute. df.iloc[100]
Filtering the dataset¶
DataFrame objects make it very easy to filter the data based on a specific column. We can use the typical Python comparison operators (==, >, >=, <, etc) to filter numerical values. For example, let's find all entries where the cell volume is greater than 580. We do this by filtering on the volume column. Note that we first produce a boolean mask – a series of True and False depending on the comparison. We can then use the mask to filter the DataFrame.
mask = df["volume"] >= 580
df[mask]
band_gap column. mask = df["band_gap"] > 0
semiconductor_df = df[mask]
semiconductor_df
drop() function. This function can be used to drop both rows and columns. The function takes a list of items to drop. For columns, this is column names whereas for rows it is the row number. Finally, the axis option specifies whether the data to drop is columns (1) or rows (0).
For example, to remove the nsites, space_group, e_electronic, and e_total columns, we can run:
cleaned_df = df.drop(["nsites", "space_group", "e_electronic", "e_total"],
axis=1)
DataFrame to see that the columns have been removed. cleaned_df.head()
Generating new columns¶
Pandas DataFrame objects also make it easy to perform simple calculations on the data. Think of this as using formulas in Excel spreadsheets. All fundamental Python math operators (such as +, -, /, and *) can be used.
For example, the dielectric dataset contains the electronic contribution to the dielectric constant (\(\epsilon_\mathrm{electronic}\), in the poly_electronic column) and the total (static) dielectric constant (\(\epsilon_\mathrm{total}\), in the poly_total column). The ionic contribution to the dataset is given by:
Below, we calculate the ionic contribution to the dielectric constant and store it in a new column called poly_ionic. This is as simple as assigning the data to the new column, even if the column doesn't already exist.
df["poly_ionic"] = df["poly_total"] - df["poly_electronic"]
df.head()
Part 2: Generating descriptors for machine learning¶
In this section, we will learn a bit about how to generate machine-learning descriptors from materials objects in pymatgen. First, we'll generate some descriptors with matminer's "featurizers" classes. Next, we'll use some of what we learned about dataframes in the previous section to examine our descriptors and prepare them for input to machine learning models.
### Featurizers transform materials primitives into machine-learnable features The general idea of featurizers is that they accept a materials primitive (e.g., pymatgen Composition) and output a vector. For example: \begin{align} f(\mathrm{Fe}_2\mathrm{O}_3) \rightarrow [1.5, 7.8, 9.1, 0.09] \end{align} #### Matminer contains featurizers for the following pymatgen objects: * Composition * Crystal structure * Crystal sites * Bandstructure * Density of states #### Depending on the featurizer, the features returned may be: * numerical, categorical, or mixed vectors * matrices * other pymatgen objects (for further processing) #### Featurizers play nice with dataframes Since most of the time we are working with pandas dataframes, all featurizers work natively with pandas dataframes. We'll provide examples of this later in the lesson. #### Featurizers present in matminer Matminer hosts over 60 featurizers, most of which are implemented from methods published in peer reviewed papers. You can find a full list of featurizers on the [matminer website](https://hackingmaterials.lbl.gov/matminer/featurizer_summary.html). All featurizers have parallelization and convenient error tolerance built into their core methods.
In this lesson, we'll go over the main methods present in all featurizers. By the end of this unit, you will be able to generate descriptors for a wide range of materials informatics problems using one common software interface.
The featurize method and basics¶
The core method of any matminer is "featurize". This method accepts a materials object and returns a machine learning vector or matrix. Let's see an example on a pymatgen composition:
from pymatgen import Composition
fe2o3 = Composition("Fe2O3")
ElementFraction featurizer. from matminer.featurizers.composition import ElementFraction
ef = ElementFraction()
element_fractions = ef.featurize(fe2o3)
print(element_fractions)
Features section in the documentation of any featurizer... but a much easier way is to use the feature_labels() method. element_fraction_labels = ef.feature_labels()
print(element_fraction_labels)
print(element_fraction_labels[7], element_fractions[7])
print(element_fraction_labels[25], element_fractions[25])
Featurizing dataframes¶
We just generated some descriptors and their labels from an individual sample but most of the time our data is in pandas dataframes. Fortunately, matminer featurizers implement a featurize_dataframe() method which interacts natively with dataframes.
Let's grab a new dataset from matminer and use our ElementFraction featurizer on it.
First, we download a dataset as we did in the previous unit. In this example, we'll download a dataset of super hard materials.
from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("brgoch_superhard_training")
df.head()
featurize_dataframe() method (implemented by all featurizers) to apply ElementFraction to all of our data at once. The only required arguments are the dataframe as input and the input column name (in this case it is composition). featurize_dataframe() is parallelized by default using multiprocessing. df = ef.featurize_dataframe(df, "composition")
df.head()
Structure Featurizers¶
We can use the same syntax for other kinds of featurizers. Let's now assign descriptors to a structure. We do this with the same syntax as the composition featurizers. First, let's load a dataset containing structures.
df = load_dataset("phonon_dielectric_mp")
df.head()
DensityFeatures. from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
densityf.feature_labels()
featurize_dataframe() to generate these features for all the samples in the dataframe. Since we are using the structures as input to the featurizer, we select the "structure" column. df = densityf.featurize_dataframe(df, "structure")
Conversion Featurizers¶
In addition to Bandstructure/DOS/Structure/Composition featurizers, matminer also provides a featurizer interface for converting between pymatgen objects (e.g., assinging oxidation states to compositions) in a fault-tolerant fashion. These featurizers are found in matminer.featurizers.conversion and work with the same featurize/featurize_dataframe etc. syntax as the other featurizers.
The dataset we loaded previously only contains a formula column with string objects. To convert this data into a composition column containing pymatgen Composition objects, we can use the StrToComposition conversion featurizer on the formula column.
from matminer.featurizers.conversions import StrToComposition
stc = StrToComposition()
df = stc.featurize_dataframe(df, "formula")
composition column has been added to the dataframe. df.head()
Advanced capabilities¶
There are powerful functionalities of Featurizers which are worth quickly mentioning before we go practice (and many more not mentioned here).
Dealing with Errors¶
Often, data is messy and certain featurizers will encounter errors. Set ignore_errors=True in featurize_dataframe() to skip errors; if you'd like to see the errors returned in an additional column, also set return_errors=True.
Citing the authors¶
Many featurizers are implemented using methods found in peer reviewed studies. Please cite these original works using the citations() method, which returns the BibTex-formatted references in a Python list. For example:
Part 3: Machine learning models¶
In parts 1 and 2, we demonstrated how to download a dataset and add machine learnable features. In part 3, we show how to train a machine learning model on a dataset and analyze the results.
Scikit-Learn¶
This unit makes extensive use of the scikit-learn package, an open-source python package for machine learning. Matminer has been designed to make machine learning with scikit-learn as easy as possible. Other machine learning packages exist, such as TensorFlow, which implement neural network architectures. These packages can also be used with matminer but are outside the scope of this workshop.
Load and prepare a pre-featurized model¶
First, let's load a dataset that we can use for machine learning. In advance, we've added some composition and structure features to the elastic_tensor_2015 dataset used in exercises 1 and 2.
import os
from matminer.utils.io import load_dataframe_from_json
df = load_dataframe_from_json(os.path.join("resources", "elastic_tensor_2015_featurized.json"))
df.head()
K_VRH) as the target property. We use the values attribute of the dataframe to give the target properties a numpy array, rather than pandas Series object. y = df['K_VRH'].values
print(y)
K_VRH column from the set of features, as the model should not know about the target property in advance. The dataset loaded above, includes structure, formula, and composition columns that were previously used to generate the machine learnable features. Let's remove them using the pandas drop() function, discussed in unit 1. Remember, axis=1 indicates we are dropping columns rather than rows.
X = df.drop(["structure", "formula", "composition", "K_VRH"], axis=1)
column attribute. print("There are {} possible descriptors:".format(X.columns))
print(X.columns)
Try a random forest model using scikit-learn¶
The scikit-learn library makes it easy to use our generated features for training machine learning models. It implements a variety of different regression models and contains tools for cross-validation.
In the interests of time, in this example we will only trial a single model but it is good practice to trial multiple models to see which performs best for your machine learning problem. A good "starting" model is the random forest model. Let's create a random forest model.
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=1)
n_estimators) set to 100. n_estimators is an example of a machine learning hyper-parameter. Most models contain many tunable hyper-parameters. To obtain good performance, it is necessary to fine tune these parameters for each individual machine learning problem. There is currently no simple way to know in advance what hyper-parameters will be optimal. Usually, a trial and error approach is used. We can now train our model to use the input features (X) to predict the target property (y). This is achieved using the fit() function.
rf.fit(X, y)
Evaluating model performance¶
Next, we need to assess how the model is performing. To do this, we first ask the model to predict the bulk modulus for every entry in our original dataframe.
y_pred = rf.predict(X)
mean_squared_error() function to calculate the mean squared error. We then take the square-root of this to obtain our final performance metric. import numpy as np
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y, y_pred)
print('training RMSE = {:.3f} GPa'.format(np.sqrt(mse)))
Cross validation¶
To obtain a more accurate estimate of prediction performance and validate that we are not over-fitting, we need to check the cross-validation score rather than the fitting score.
In cross-validation, the data is partitioned randomly into \(n\) "splits" (in this case 10), each containing roughly the same number of samples. The model is trained on \(n-1\) splits (the training set) and the model performance evaluated by comparing the actual and predicted values for the final split (the testing set). In total, this process is repeated \(n\) times, such that each split is at some point used as the testing set. The cross-validation score is the average score across all testing sets.
There are a number of ways to partition the data into splits. In this example, we use the KFold method and select the number of splits to be 10. I.e., 90 % of the data will be used as the training set, with 10 % used as the testing set.
from sklearn.model_selection import KFold
kfold = KFold(n_splits=10, random_state=1)
random_state=1 to ensure every attendee gets the same answer for their model. Finally, obtaining the cross validation score can be automated using the Scikit-Learn cross_val_score() function. This function requires a machine learning model, the input features, and target property as arguments. Note, we pass the kfold object as thecv argument, to make cross_val_score() use the correct test/train splits.
For each split, the model will be trained from scratch, before the performance is evaualated. As we have to train and predict 10 times, cross validation can often take some time to perform. In our case, the model is quite small, so the process only takes about a minute. The final cross validation score is the average across all splits.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(rf, X, y, scoring='neg_mean_squared_error', cv=kfold)
rmse_scores = [np.sqrt(abs(s)) for s in scores]
print('Mean RMSE: {:.3f}'.format(np.mean(rmse_scores)))
Visualizing model performance¶
We can visualize the predictive performance of our model by plotting the our predictions against the actual value, for each sample in the test set for all test/train splits. First, we get the predicted values of the testing set for each split using the cross_val_predict method. This is similar to the cross_val_score method, except it returns the actual predictions, rather than the model score.
from sklearn.model_selection import cross_val_predict
y_pred = cross_val_predict(rf, X, y, cv=kfold)
PlotlyFig module of matminer, which helps you quickly produce publication ready diagrams. PlotlyFig can produce many different types of plots. Explaining its use in detail is outside the scope of this tutorial but examples of the available plots are provided in the FigRecipes section of the matminer_examples repository. from matminer.figrecipes.plot import PlotlyFig
pf = PlotlyFig(x_title='DFT (MP) bulk modulus (GPa)',
y_title='Predicted bulk modulus (GPa)',
mode='notebook')
pf.xy(xy_pairs=[(y, y_pred), ([0, 400], [0, 400])],
labels=df['formula'],
modes=['markers', 'lines'],
lines=[{}, {'color': 'black', 'dash': 'dash'}],
showlegends=False)
Model interpretation¶
An important aspect of machine learning is being able to understand why a model is making certain predictions. Random forest models are particularly amenable to interpretation as they possess a feature_importances attribute, which contains the importance of each feature in deciding the final prediction. Let's look at the feature importances of our model.
rf.feature_importances_
PlotlyFig to plot the importances of the 5 most important features. importances = rf.feature_importances_
included = X.columns.values
indices = np.argsort(importances)[::-1]
pf = PlotlyFig(y_title='Importance (%)',
title='Feature by importances',
mode='notebook')
pf.bar(x=included[indices][0:5], y=importances[indices][0:5])
Part 4: Automated machine learning using automatminer¶
Automatminer is a package for automatically creating ML pipelines using matminer's featurizers, feature reduction techniques, and Automated Machine Learning (AutoML). Automatminer works end to end - raw data to prediction - without any human input necessary.
**Put in a dataset, get out a machine that predicts materials properties.** Automatminer is competitive with state of the art hand-tuned machine learning models across multiple domains of materials informatics. Automatminer also included utilities for running MatBench, a materials science ML benchmark. **Learn more about Automatminer and MatBench from the [official documentation](http://hackingmaterials.lbl.gov/automatminer/).** ### How does automatminer work? Automatminer automatically decorates a dataset using hundreds of descriptor techniques from matminer’s descriptor library, picks the most useful features for learning, and runs a separate AutoML pipeline. Once a pipeline has been fit, it can be summarized in a text file, saved to disk, or used to make predictions on new materials.
Materials primitives (e.g., crystal structures) go in one end, and property predictions come out the other. MatPipe handles the intermediate operations such as assigning descriptors, cleaning problematic data, data conversions, imputation, and machine learning. #### MatPipe is the main Automatminer object `MatPipe` is the central object in Automatminer. It has a sklearn BaseEstimator syntax for `fit` and `predict` operations. Simply `fit` on your training data, then `predict` on your testing data. #### MatPipe uses [pandas](https://pandas.pydata.org>) dataframes as inputs and outputs. Put dataframes (of materials) in, get dataframes (of property predictions) out. ### Overview In this section, we walk through the basic steps of using Automatminer to train and predict on data. We'll also view the internals of our AutoML pipeline using Automatminer's API. * First, we'll load a dataset of ~4,600 dielectric constants from the Materials Project. * Next, we'll fit a Automatminer `MatPipe` (pipeline) to the data * Then, we'll predict dielectric constants from the structure, and see how our predictions do (note, this is not an easy problem!) * We'll examine our pipeline with `MatPipe`'s introspection methods. * Finally, we look at how to save and load pipelines for reproducible predictions. *Note: for the sake of brevity, we will use a single train-test split in this notebook. To run a full Automatminer benchmark, see the documentation for `MatPipe.benchmark`* ### Preparing a dataset for machine learning
Let's load a dataset to play around with. For this example, we will use matminer to load one of the MatBench v0.1 datasets.
df = load_dataset("matbench_dielectric")
"structure" and "n" (dielectric constant) columns are present.df.head()
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True, random_state=20191014)
Our target variable is "n".
target = "n"
prediction_df = test_df.drop(columns=[target])
prediction_df.head()
Fitting and predicting with Automatminer's MatPipe¶
Now we have everything we need to start our AutoML pipeline. For simplicity, we will use a MatPipe preset. MatPipe is highly customizable and has hundreds of configuration options, but most use cases will be satisfied by using one of the preset configurations. We use the from_preset method.
In this example, due to interests of time we'll use the "debug" preset, which will spend approximately 1.5 minutes doing machine learning. The "express" preset is a good choice if you have more time available.
from automatminer import MatPipe
pipe = MatPipe.from_preset("debug")
Fitting the pipeline¶
To fit an Automatminer MatPipe to the data, pass in your training data and desired target.
pipe.fit(train_df, target)
Predicting new data¶
Our MatPipe is now fit. Let's predict our test data with MatPipe.predict. This should only take a few minutes.
prediction_df = pipe.predict(prediction_df)
Examine predictions¶
MatPipe places the predictions a column called "{target} predicted":
prediction_df.head()
Score predictions¶
Now let's score our predictions using mean average error, and compare them to a Dummy Regressor from sklearn.
from sklearn.metrics import mean_absolute_error
from sklearn.dummy import DummyRegressor
# fit the dummy
dr = DummyRegressor()
dr.fit(train_df["structure"], train_df[target])
dummy_test = dr.predict(test_df["structure"])
# Score dummy and MatPipe
true = test_df[target]
matpipe_test = prediction_df[target + " predicted"]
mae_matpipe = mean_absolute_error(true, matpipe_test)
mae_dummy = mean_absolute_error(true, dummy_test)
print("Dummy MAE: {}".format(mae_dummy))
print("MatPipe MAE: {}".format(mae_matpipe))
Examining the internals of MatPipe¶
Inspect MatPipe internals with a dict/text digest from either MatPipe.inspect (long, comprehensive version of all proper attriute names) or MatPipe.summarize (executive summary).
import pprint
# Get a summary and save a copy to json
summary = pipe.summarize(filename="MatPipe_predict_experimental_gap_from_composition_summary.json")
pprint.pprint(summary)
# Explain the MatPipe's internals more comprehensively
details = pipe.inspect(filename="MatPipe_predict_experimental_gap_from_composition_details.json")
print(details)
Access MatPipe's internal objects directly.¶
You can access MatPipe's internal objects directly, instead of via a text digest; you just need to know which attributes to access. See the online API docs or the source code for more info.
# Access some attributes of MatPipe directly, instead of via a text digest
print(pipe.learner.best_pipeline)
print(pipe.autofeaturizer.featurizers["composition"])
print(pipe.autofeaturizer.featurizers["structure"])
Persistence of pipelines¶
Being able to reproduce your results is a crucial aspect of materials informatics. MatPipe provides methods for easily saving and loading entire pipelines for use by others.
Save a MatPipe for later with MatPipe.save. Load it with MatPipe.load.
filename = "MatPipe_predict_experimental_gap_from_composition.p"
pipe.save(filename)
pipe_loaded = MatPipe.load(filename)